-
Notifications
You must be signed in to change notification settings - Fork 66
Combined Celery+Julia Pods and Cron-job rolling restarts #655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@GUI let me know if you have thoughts on the TODO for Jenkinsfile-restart-celery-julia.yaml, mentioned in the PR description. |
This will log requests hitting the Julia HTTP server making it a little more obvious what's happening in the logs.
- Add some missing variables needed even for this basic restart task. - Wait for rollout restarts to complete so we know if they've been successful or not.
GUI
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Bill-Becker: I haven't analyzed the performance of things after this change, but I think the basic change to have a 1:1 relationship between the Celery workers and the Julia pods looks good. I'm still not sure this will totally solve the performance issues you've seen, but it will hopefully at least alleviate the potential imbalance of load on Julia containers given how the queuing currently works.
Regarding the restart Jenkins task, I've added configuration for that (https://github.nrel.gov/TADA/tada-jenkins-config/pull/22), so you should now find a "restart-celery-julia" job in Jenkins. I've updated the Jenkinsfile in this branch to what I believe will be a functional version of what you were after. I was able to run it successfully against this branch, but I believe once it lands on master, then the cron-job style should kick in.
More generally, there might be more Kubernetes-native ways to accomplish this type of restart for misbehaving pods that could be more resilient to various issues. For example, Kubernetes health checks and memory limits should be configurable so that pods automatically restart once they exceed the memory threshold and/or if they are detected as unhealthy. For example, with the latest Redis issue the past week where things stopped working at a specific time, you'd maybe have to wait up to a day for this scheduled task to kick in and restore functionality. If you had Kubernetes health checks configured on the pods, then Kubernetes could restart those as soon as it detects a failure. However, this obviously may require more work to implement these type of accurate health checks, and all of these approaches are still sort of bandaids on whatever the underlying issues are. And I think you've maybe explored memory limits before, but I know all of this has been particularly funky, so I'm definitely not familiar enough with the ins-and-outs of this application to really know what's going on. But if these scheduled restarts can help, then I think hopefully the job is at least setup now in Jenkins to execute.
This PR address the Kubernetes server issues of:
-Combine Celery+Julia containers together in one pod, while leaving separate Julia-only pods for non-celery Julia API calls
-Tested on staging API deploy of this branch - POST requests to /job endpoint only go to these pods (not the Julia-only pods), and Julia-only pods get all the non-celery API requests. You can see the Julia container logs of the Celery+Julia pods only if you click on the pod and then view the Julia logs. Kubernetes seems to balance Celery jobs OK but sometimes stacks multiple consecutive requests on the same pod even when the other one does not have celery jobs running.
-Cron job rolling restarts of Julia containers
Also:
3. Update production and staging resources
4. Align number of gunicorn workers with max Django pod CPUs